Algorithmic optimization tips |
Users don't just like filters. Users like fast filters. When optimizing a filter, you should always begin with high-level optimizations before resorting to dirty C and assembly tricks.
Separable filters |
Always strive to lower the complexity of your filter. The most common filters are convolution filters; these usually involve grabbing an area of source pixels around the desired pixel, multiplying each by a coefficient, and summing all scaled values to produce a destination pixels. Unfortunately, this is an expensive operation, typically turning non-realtime after 10 or so coefficients. Check to make sure that you can't optimize the kernel, especially if the kernel is constant. Take this kernel for example:
1 4 6 4 1 4 16 24 16 4 6 24 36 24 6 4 16 24 16 4 1 4 6 4 1
It's a pretty simple kernel; when normalized by dividing by 256, it acts as a simple radius-2 blur filter. A quick implementation requires 25 loads, 25 multiplies, 24 adds, and 1 divide. We can apply the obvious binary optimizations to reduce the code to 25 loads, 12 left shifts, 9 multiplies, 24 adds, and 1 right shift. Collecting common coefficients reduces the count further to 25 loads, 2 left shifts, 3 multiplies, 24 adds, and 1 right shift. This looks much better than the brute-force filter, so we can begin low-level optimization, right?
Wrong. In fact, this is a very slow radius-2 filter. Why? Because this filter is separable into a row and a column filter:
1 4 6 1 4 6 4 1 4 1
It doesn't matter whether you apply the row or column transform first to the image. Applying the same algorithmic optimizations above to these two filters, we get 10 loads, 2 left shifts, 8 adds, 2 multiplies, and 1 right shift for each pixel. That is an order of magnitude better than the original 2D filter -- and the gains go up dramatically for larger kernels. Always check to see if your filter is separable before coding it.
Never look down |
Fine. You rework your radius-2 blur implementation to use a row and column transform, and guess what... it's slower than sin. Yet, VirtualDub's "blur more" filter is implemented the same way, and it runs much faster. Must be the MMX optimizations, right?
Nope. The most likely cause is that you implemented the column filter as the mirror of the row filter, by scanning down. This is very bad because the L1 cache on an x86 CPU typically only contains 2 or 4 slots for a value at a particular address, and it repeats. For a 320 pixel wide image, you will start reusing the same cache slots every 16 rows. This means by the time you have completed one column, the entries at the top of the image will have been flushed out, and will have to be reloaded from the L2 cache... and this will repeat for every column in your image. Never scan vertically down a bitmap if you have a choice. Instead, rework the column filter to generate pixels horizontally. This will be much faster.
Why is this in the algorithmic optimizations page? You will need to take this into account when designing your algorithm. It is difficult to correct vertical scanning on a production filter, especially once you have low-level optimized a filter. Horizontal scanning may actually be easier to code, however; in some image processing algorithms, the filter kernel of the column transform will change depending on the row, so when you change the scanning direction to horizontal, the coefficients are invariant across each row. (Obviously, you won't want to rework the row transform to take advantage of this same optimization.) For an example of this trick, see VirtualDub's resize filter.
VirtualDub external filter SDK 1.05 | ©1999-2001 Avery Lee <phaeron@virtualdub.org> |